This script generates tables and figures for EHR data quality control (QC). It processes NLP and codified datasets, ensuring data consistency and reliability for analysis.

Requirements

You should have two datasets (NLP & codified) that include at least:

Additionally, you need a data dictionary containing:

Setup

Define the following in Module 1:

Note that if you are not running this code on O2, you will need to download codified and NLP features from the ONCE webapp and manually specify directory paths for these dictionaries.

Module 0: Sample QC Data

This optional module samples 1,000 unique patients from the intersection of the NLP and codified datasets to speed up QC processing.

Outputs

  1. Sampled NLP dataset (to be used in Module 1)
  2. Sampled codified dataset (to be used in Module 1)

Module 1: Data and Dictionary Import & Preparation

This module imports and prepares the NLP and codified datasets for analysis. It also imports three data dictionaries - an institution-specific data dictionary with codified feature descriptions (user-defined) and two ONCE dictionaries for selecting similar features to the target PheCode and CUI (automatically uploaded from O2).

Outputs

  1. Cleaned NLP and codified datasets
  2. Filtered data dictionary with target and common code feature descriptions
  3. ONCE dictionaries with selected codified and NLP features

Module 2: Patient, Code, and Follow-up Summaries

This module summarizes the NLP and codified datasets, including patient counts, prevalence of the target PheCode and CUI, and duration of patient follow-up. Patient counts are summarized annually (line plots) and overall (bar plots).

Outputs

  1. Patient counts over time
  2. Table of follow-up duration statistics
Total Sample Size
Dataset Number of Patients
NLP 1000
Codified 1000